Chromatin Immunoprecipitation Sequencing ◾ 215
The library preparation of the ChIP-Seq DNA fragments follows the same steps as
that of the whole genome sequencing (WGS), which includes fragmentation, end repair,
adaptor ligation, and enrichment. The sequencing steps follow the same steps used for
DNA sequencing by the sequencing technology. The sequencing raw data includes millions
of ChIP-Seq reads.
The sequencing strategies used for ChIP-Seq are the same as the ones followed for the
WGS and RNA-Seq. The design of the ChIP-Seq experiment is usually tailored to the condi-
tion studies and that design will guide the subsequent data analysis. The raw data produced
by the sequencer are raw reads in FASTQ files. Sequencing can be single end or paired end,
short reads (e.g., Illumina) or long reads (e.g., PacBio). However, most ChIP-Seq datasets
have been generated using single-end libraries and we should be aware that some programs
do not use paired-end libraries. The run can be for a single sample or multiplexed for sev-
eral samples; the fragments of each sample in the run are with a unique barcode.
6.3 CHIP-SEQ ANALYSIS WORKFLOW
In general, the ChIP-Seq analysis workflow includes raw data acquisition, quality control,
read alignment, alignment quality control, peak calling, combining peak calls, and final
analysis (visualization, motif discovery, and annotation and functional enrichment). You
are already familiar with the first four steps, which were discussed in detail in Chapters 1
and 2. ChIP-Seq raw data can be either provided by the sequencing facility for an experi-
ment or can be downloaded from a database. In either case, you may need to reprocess the
FASTQ files (refer to Chapter 1). There are several databases from which we can down-
load ChIP-Seq raw data submitted by other researchers either as supplementary material
for their publications or may be submitted as part of a project dedicated for investigating
some conditions and the data is made public for researchers. The NCBI SRA is the most
commonly used database for these purposes and it integrates most of the other public
databases. FASTQ files are downloaded from the NCBI SRA database using SRA-toolkit,
which we used in the previous chapters. ChIP reads can be subjected to quality control
by following the steps discussed in Chapter 1. FastQC program is used to assess the qual-
ity of the reads in the files. The reads then can be preprocessed, if needed, to remove the
reads with low quality, trim the low-quality ends or adaptor sequence, and remove dupli-
cate reads and other technical reads. The step of quality control is always crucial as in
other sequencing applications to avoid misleading interpretation of results. Read mapping
is performed by aligning ChIP reads and control reads to a reference genome. The same
aligners (e.g., BWA and Bowtie) used for aligning reads from WGS can also be used for
ChIP-Seq data (refer to Chapter 2). The alignment information produced by the aligner
is stored in SAM/BAM files. Before proceeding to the next step, we can remove duplicate
reads. Duplicate reads are generated from a single read; they are identical and aligned to
the same region forming low library complexity in that region. For ChIP-Seq data, reads
are aligned upstream and downstream around the binding sites, leaving the regions of
the binding site with low sequence coverage. The read pileup density (also called signal)
around the binding sites should form bimodal enrichment patterns, with Watson strand
tags enriched upstream of binding and Crick strand tags enriched downstream. The shape